[SPARK-53355][PYTHON][SQL] test python udf type behavior #52105

benrobby · 2025-08-22T22:07:37Z

What changes were proposed in this pull request?

this captures Python UDF return type coercion behavior in a new test suite. It covers:
- udf (script from [SPARK-52952][PYTHON] Add PySpark UDF Type Coercion Dev Script #51663)
- pandas_udf (script from [SPARK-52943][PYTHON] Enable arrow_cast for all pandas UDF eval types #51635)
Instead of manually regenerating results, we now capture the behavior in automated tests, this will allow us to catch type issues earlier. The tests also adds expected results in golden files.
in addition to above return type coercion tests, this also adds a new test for input type behavior of udfs. It documents the Python datatypes that UDFs receive.
also adds a helper script that prints a sensible diff between the tables. Example:

Why are the changes needed?

it's too easy to miss inadvertent type coercion changes, and some of the existing tables in the codebase have become inconsistent.

Does this PR introduce any user-facing change?

No

How was this patch tested?

this is a unit test

Was this patch authored or co-authored using generative AI tooling?

Yes, used claude to generate the table printing utils. Generated-by: claude-sonnet-4

benrobby · 2025-08-23T00:48:21Z

@HyukjinKwon and @asl could you take a look?

asl3

cc @zhengruifeng @HyukjinKwon

asl3 · 2025-08-23T07:15:18Z

python/pyspark/sql/tests/udf_type_tests/test_udf_input_types.py

+
+        return [
+            ("byte_values", ByteType(), df([(-128,), (127,), (0,)])),
+            ("byte_null", ByteType(), df([(None,), (42,)])),


should we add a comment with context about the null handling importance for input type vs. return type?

Would you have an example, what comment are you looking for exactly?

optional nit, but i meant we could clarify the context for the return_types and input_types

HyukjinKwon · 2025-08-24T23:35:57Z

python/pyspark/sql/tests/udf_type_tests/README.md

@@ -0,0 +1,5 @@
+These tests capture input/output type interfaces between python udfs and the engine.


Do we want to run this in CI? If so, we should add __init__.py ,and add this modul into dev/sparktestsupport/modules.py

yes, let me try to add it. Do we need any other reviewer for the infra changes?

HyukjinKwon · 2025-08-24T23:37:16Z

My only concern is that I don't think we should say this is the standard behaviours ... as some of behaviours are weird. I am fine with this change as long as we're all on the same page that some behaviours here might change in the future.

benrobby · 2025-08-25T08:42:23Z

My only concern is that I don't think we should say this is the standard behaviours ... as some of behaviours are weird. I am fine with this change as long as we're all on the same page that some behaviours here might change in the future.

Yes, agreed. I've also updated the readme to reflect that these tests are purely internal and not an API documentation

zhengruifeng · 2025-08-26T02:30:09Z

dev/sparktestsupport/modules.py

    ],
 )

+pyspark_types = Module(


I am kind of hesitant whether it is too much to add a new testing module.
maybe we can just add these tests in pyspark-sql?

I don't have a strong preference. @HyukjinKwon ?

pyspark-sql alone should be fine

alright, I moved the test to run under pyspark-sql

benrobby · 2025-08-28T12:04:32Z

@HyukjinKwon this one is ready from my side :)

HyukjinKwon · 2025-08-28T23:17:09Z

Merged to master.

zhengruifeng · 2025-09-04T00:10:46Z

@benrobby the new test fails in two scheduled jobs:
https://github.com/apache/spark/actions/runs/17398781217/job/49386969447
https://github.com/apache/spark/actions/runs/17429965634/job/49486072652

would you mind taking a look? If it is acceptable, then we can skip it in the two envs.
you can check the env in step List Python Packages

benrobby · 2025-09-05T08:19:09Z

@benrobby the new test fails in two scheduled jobs: https://github.com/apache/spark/actions/runs/17398781217/job/49386969447 https://github.com/apache/spark/actions/runs/17429965634/job/49486072652

would you mind taking a look? If it is acceptable, then we can skip it in the two envs. you can check the env in step List Python Packages

Thanks for flagging. This is caused by older numpy versions implementing __repr__ / __string__ differently. So it's not an actual type change, just a test-only issue. I'll adjust the test to align this, will ping you once ready

benrobby · 2025-09-05T09:00:32Z

@zhengruifeng here's the fixup pr #52247 :)

### What changes were proposed in this pull request? - this is a minor followup to #52105, we noticed that the test breaks in two spark master runs with a different env - the root cause was that numpy 1.x implements `__repr__` differently ### Why are the changes needed? - fix `Build / Python-only (master, Minimum dependencies of PySpark)` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ran tests locally with numpy 1.22.4 ### Was this patch authored or co-authored using generative AI tooling? No Closes #52247 from benrobby/SPARK-53355-fix-numpy-repr. Authored-by: Ben Hurdelhey <ben.hurdelhey@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

github-actions bot added SQL PYTHON labels Aug 22, 2025

[SPARK-53355] document python udf type behavior

7e71571

benrobby force-pushed the SPARK-53355 branch from 94111e5 to 7e71571 Compare August 22, 2025 22:09

benrobby changed the title ~~[SPARK-53355][PYTHON] document python udf type behavior~~ [SPARK-53355][PYTHON][SQL] document python udf type behavior Aug 23, 2025

add input tests

facb61f

benrobby force-pushed the SPARK-53355 branch from 85c8922 to facb61f Compare August 23, 2025 00:45

github-actions bot added the DOCS label Aug 23, 2025

asl3 reviewed Aug 23, 2025

View reviewed changes

zhengruifeng requested a review from HyukjinKwon August 24, 2025 02:00

HyukjinKwon changed the title ~~[SPARK-53355][PYTHON][SQL] document python udf type behavior~~ [SPARK-53355][PYTHON][SQL] Document python udf type behavior Aug 24, 2025

HyukjinKwon reviewed Aug 24, 2025

View reviewed changes

run in ci

1ad29e4

github-actions bot added BUILD INFRA labels Aug 25, 2025

benrobby changed the title ~~[SPARK-53355][PYTHON][SQL] Document python udf type behavior~~ [SPARK-53355][PYTHON][SQL][INFRA] Document python udf type behavior Aug 25, 2025

benrobby changed the title ~~[SPARK-53355][PYTHON][SQL][INFRA] Document python udf type behavior~~ [SPARK-53355][PYTHON][SQL][INFRA] test python udf type behavior Aug 25, 2025

format

80f6970

benrobby requested review from asl3 and HyukjinKwon August 25, 2025 09:23

HyukjinKwon approved these changes Aug 25, 2025

View reviewed changes

benhurdelhey added 2 commits August 25, 2025 12:23

add header

d8c84ed

fix golden file

b965db6

asl3 approved these changes Aug 25, 2025

View reviewed changes

zhengruifeng reviewed Aug 26, 2025

View reviewed changes

zhengruifeng approved these changes Aug 26, 2025

View reviewed changes

lint, improve readme

ec96128

move to pyspark-sql test, fix lint

2735b82

github-actions bot removed the INFRA label Aug 26, 2025

fix tests

403c202

benrobby changed the title ~~[SPARK-53355][PYTHON][SQL][INFRA] test python udf type behavior~~ [SPARK-53355][PYTHON][SQL] test python udf type behavior Aug 27, 2025

benhurdelhey added 2 commits August 27, 2025 20:17

format

f879bb1

lint

21529ef

xinrong-meng approved these changes Aug 28, 2025

View reviewed changes

HyukjinKwon closed this in 7b8877f Aug 28, 2025

benrobby mentioned this pull request Sep 5, 2025

[SPARK-53355][PYTHON][SQL] fix numpy 1.x repr in type tests #52247

Closed

		@@ -0,0 +1,5 @@
		These tests capture input/output type interfaces between python udfs and the engine.

[SPARK-53355][PYTHON][SQL] test python udf type behavior #52105

[SPARK-53355][PYTHON][SQL] test python udf type behavior #52105

Uh oh!

Conversation

benrobby commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

benrobby commented Aug 23, 2025

Uh oh!

asl3 left a comment

Choose a reason for hiding this comment

Uh oh!

asl3 Aug 23, 2025

Choose a reason for hiding this comment

Uh oh!

benrobby Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

asl3 Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 24, 2025

Choose a reason for hiding this comment

Uh oh!

benrobby Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Aug 24, 2025

Uh oh!

benrobby commented Aug 25, 2025

Uh oh!

zhengruifeng Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benrobby Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

benrobby Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

benrobby commented Aug 28, 2025

Uh oh!

HyukjinKwon commented Aug 28, 2025

Uh oh!

zhengruifeng commented Sep 4, 2025

Uh oh!

benrobby commented Sep 5, 2025

Uh oh!

benrobby commented Sep 5, 2025

Uh oh!

Uh oh!

benrobby commented Aug 22, 2025 •

edited

Loading

asl3 Aug 25, 2025 •

edited

Loading

zhengruifeng Aug 26, 2025 •

edited

Loading